Natasha 2: Faster Non-Convex Optimization Than SGD

نویسنده

Zeyuan Allen-Zhu

چکیده

We design a stochastic algorithm to train any smooth neural network to ε-approximate local minima, using O(ε−3.25) backpropagations. The best result was essentially O(ε−4) by SGD. More broadly, it finds ε-approximate local minima of any smooth nonconvex function in rate O(ε−3.25), with only oracle access to stochastic gradients. ∗V1 appeared on arXiv on this date. V2 and V3 polished writing. This paper is built on, but should not be confused with, the offline method Natasha1 [3] which only finds approximate stationary points. When this manuscript first appeared online, the best rate was indeed T = O(ε−4) by SGD. Several follow-up works appeared after this paper but citing us. This includes stochastic cubic regularization [47] which gives T = O(ε−3.5) in Nov 2017, and Neon+SCSG [10, 49] which gives T = O(ε−3.333) in Nov 2017. These rates are worse than T = O(ε−3.25). Our original method also requires oracle access to Hessian-vector products. However, the follow-up paper of Allen-Zhu and Li [10] enables us to replace the use of Hessian-vector products with stochastic gradient computations. We have revised this manuscript in V3 to reflect this change. ar X iv :1 70 8. 08 69 4v 3 [ m at h. O C ] 2 3 Fe b 20 18

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic Variance Reduction for Nonconvex Optimization

We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (Svrg) methods for them. Svrg and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (Sgd); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary...

متن کامل

VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning

In this paper, we propose a simple variant of the original SVRG, called variance reduced stochastic gradient descent (VR-SGD). Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much larger learning rates, and also make ...

متن کامل

signSGD: compressed optimisation for non-convex problems

Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. SIGNSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. SIGNSGD can exploit mismatches ...

متن کامل

Larger is Better: The Effect of Learning Rates Enjoyed by Stochastic Optimization with Progressive Variance Reduction

In this paper, we propose a simple variant of the original stochastic variance reduction gradient (SVRG) [1], where hereafter we refer to as the variance reduced stochastic gradient descent (VR-SGD). Different from the choices of the snapshot point and starting point in SVRG and its proximal variant, Prox-SVRG [2], the two vectors of each epoch in VRSGD are set to the average and last iterate o...

متن کامل

Annealed Gradient Descent for Deep Learning

Stochastic gradient descent (SGD) has been regarded as a successful optimization algorithm in machine learning. In this paper, we propose a novel annealed gradient descent (AGD) method for non-convex optimization in deep learning. AGD optimizes a sequence of gradually improved smoother mosaic functions that approximate the original non-convex objective function according to an annealing schedul...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1708.08694 شماره

صفحات -

تاریخ انتشار 2017

Natasha 2: Faster Non-Convex Optimization Than SGD

نویسنده

چکیده

منابع مشابه

Stochastic Variance Reduction for Nonconvex Optimization

VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning

signSGD: compressed optimisation for non-convex problems

Larger is Better: The Effect of Learning Rates Enjoyed by Stochastic Optimization with Progressive Variance Reduction

Annealed Gradient Descent for Deep Learning

عنوان ژورنال:

اشتراک گذاری